Back

Nature Biotechnology

Springer Science and Business Media LLC

Preprints posted in the last 30 days, ranked by how well they match Nature Biotechnology's content profile, based on 147 papers previously published here. The average preprint has a 0.35% match score for this journal, so anything above that is already an above-average fit.

1
DAMPA - accelerated and simplified design of probe panels for targeted metagenomics using pangenome graphs

Payne, M.; Tam, K. K.-G.; Rockett, R. J.; Basile, K.; Bowden, R.; Sintchenko, V.; Kok, J.; Golubchik, T.

2026-05-22 infectious diseases 10.64898/2026.05.15.26352859 medRxiv
Top 0.1%
42.5%
Show abstract

Targeted metagenomics, where samples are enriched for multiple organisms of interest using oligonucleotide probes, is a highly efficient sequencing methodology that is becoming standard practice for genomics of viruses and complex polymicrobial samples. Efficient enrichment critically requires probes that capture both conserved and highly diverse genomic regions without loss of sensitivity, and with uniform representation in the sequencing pool. Design of optimal probesets poses a challenge: existing computational methods use k-mer hashing to reduce over-abundant sequences, but scalability and efficiency drop with increasing numbers of genomes, while diverse sequences remain under-represented. Here we show that incorporating evolutionary distance to compress probes via a graph-based representation of multiple genomes across species, together with k-mer hashing, reduces overrepresentation of conserved sequences, and yields more uniform coverage even of highly diverse loci. We make the method available in Dampa, an open-source tool that generates probesets in seconds on a standard laptop.

2
upSPLAT: Early-Barcoded Library Preparation for Cost-Effective Population-Scale Genomics

Raine, A.; Daniels, R. J.; Kjellin, J.; Wiman, A.-C.; Liljedahl, U.; Ramsell, J.; Wheat, C. W.; Gotthard, K.; Pettersson, M. E.; Andersson, L.; Nordlund, J.

2026-05-13 genomics 10.64898/2026.05.09.723775 medRxiv
Top 0.1%
28.4%
Show abstract

Advances in high-throughput sequencing have substantially reduced sequencing costs, yet library preparation remains a major financial and logistical bottleneck, particularly for high-throughput applications or low-quality DNA inputs. Here, we introduce upscaled Splinted Ligation Adapter Tagging (upSPLAT), a library preparation strategy that combines early sample barcoding with single-stranded splinted ligation to enable highly multiplexed pooled sequencing at substancially reduced cost. upSPLAT supports flexible high-plex pooling and reduces per-sample library preparation costs by approximately 10-fold compared to conventional workflows. By leveraging single-strand ligation, upSPLAT is compatible with a wide range of DNA inputs, including degraded, damaged or denatured double stranded DNA, bisulfite or enzymatically converted DNA, and viral single-stranded DNA. We present two complementary workflows and evaluate their performance across multiple species and DNA qualities, demonstrating robust demultiplexing, uniform sample representation, and low barcode cross-assignment. Together, upSPLAT provides a scalable, cost-effective solution for sequencing-based studies requiring large sample numbers while preserving individual-level information.

3
ZipStrain Enables Rapid and Precise Strain-Resolved Metagenomics

Ghadermazi, P.; Emerson, J. B.; Olm, M. R.

2026-05-22 bioinformatics 10.64898/2026.05.20.726564 medRxiv
Top 0.1%
25.4%
Show abstract

Strain-resolved metagenomics characterizes microbial communities at nucleotide-level resolution, enabling researchers to differentiate identical from closely related organisms and characterize population structure and gene content variation. Here we introduce ZipStrain, a program that performs highly accurate strain-resolved metagenomics over 500x faster than available methods while offering superior RAM management. Applied to a dataset of 2,754 samples spanning human populations, we identify a strain-sharing gradient across social relationships, reveal striking variation in clonal structure across bacteria and bacteriophage, and pinpoint genes whose nucleotide identity deviates from genome-wide expectations. ZipStrain is distributed as an open-source Python package and accompanying Nextflow pipeline at https://github.com/OlmLab/ZipStrain.

4
NanoCortex: A Unified Agentic System for Nanopore Sequencing Analysis

Xia, Q.; Wang, Z.; Shokoufandeh, M.; Rouhanifard, S. H.; Wanunu, M.

2026-05-21 bioinformatics 10.64898/2026.05.19.726254 medRxiv
Top 0.1%
23.2%
Show abstract

Nanopore sequencing has enabled various layers of information about DNA and RNA sequence isoforms and chemical modifications. Yet, the archipelago of disjoint nanopore analysis tools makes navigating among these a significant challenge for the nanopore user. We present NanoCortex, a unified autonomous agentic framework designed to bridge this shortcoming by providing end-to-end data processing which ranges from raw signal basecalling to biological interpretation. Built upon Gemini API services that incur usage-based API costs and orchestrated through the Gemini Agent Development Kit (ADK), the system utilizes a multi-agent architecture to autonomously perform task parsing, code generation, iterative code-level self-correction of code, and scientific interpretation. Following code generation, the code can be used offline. Benchmarking reveals that NanoCortex achieves significantly higher usability across complex analytical tasks compared to general-purpose large language models. The framework seamlessly integrates experimental data with meta-analysis of publicly available, biological databases to facilitate the extraction of biologically meaningful insights from sequencing data without cumbersome computational steps.

5
NanoLabel: A fast and accurate real-time nanopore signal classifier

Mahajan, D.; Jain, C.; Kashyap, N.

2026-05-06 genomics 10.64898/2026.05.03.722500 medRxiv
Top 0.1%
22.8%
Show abstract

Oxford Nanopore Technologies adaptive sampling capability promises to reduce sequencing cost and turnaround time. At its core, adaptive sampling is a real-time classification problem that distinguishes reads originating from regions of interest. Direct signal-based classification approaches bypass the computational bottleneck of basecalling and can eliminate the need for powerful GPUs. However, operating directly on noisy raw signals remains challenging in real-time settings, where classification decisions must be made quickly. In this work, we propose NanoLabel, a new method for real-time classification of nanopore signals. We build NanoLabel on top of signal-based read mapping tool, RawHash2. We accelerate the classification workflow by mapping reads using only the target regions as the reference. To further improve accuracy, we train a lightweight classifier on mapping-derived features and introduce a data augmentation strategy to construct sufficiently large and class-balanced training datasets. We evaluate NanoLabel using publicly available real sequencing datasets from three human genomes (HG001, HG002, and HG005), while assuming a cancer gene panel as the target. Compared to directly mapping reads with RawHash2, we demonstrate 80 x improvement in the classification time and 0.10 - 0.25 units improvement in the F1 score.

6
Programmable Repair of Disease-Causing UGA Stop Codons in Mammalian Brain

Al Saneh, A.; Gissot, L.; Ahern, C. A.

2026-05-16 neuroscience 10.64898/2026.05.13.724978 medRxiv
Top 0.1%
22.7%
Show abstract

Protein truncating variants caused by UGA stop codons are the most prevalent class of rare variant mutations in neurodevelopmental diseases. Suppressor transfer RNA (sup-tRNA) have therapeutic potential for premature termination codon (PTC) repair, but have thus far underperformed by traditional AAV delivery platforms and progress has been hampered by the lack of methods to non-invasively assess in vivo activity in mammalian brain. To fill this material gap, we utilize transcranial in vivo bioluminescence imaging data from a luciferase-UGA mouse model to enable payload optimization. These data demonstrate that U6 promotor and AAV2/9 capsids have the lowest in vivo activity, whereas self-complementary AAV2/9 with the tRNA in a minimal 100bp genomic context provide broad and efficacious PTC rescue. Further, payload tRNA multiplexing and use of tRNA introns enable efficacy of low viral titers and sustained rescue. tRNA sequencing of scAAV delivered ArgUGA sup-tRNA in brain demonstrate no effects on endogenous tRNA levels, their acylation or processing, and these features are also maintained in scAAV delivered ArgUGA sup-tRNA. Collectively, this work defines a scalable strategy for precision UGA stop codon suppression, supporting development of durable genetic rescue therapies for neurodevelopmental disorders in the mammalian brain. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/724978v2_ufig1.gif" ALT="Figure 1"> View larger version (30K): org.highwire.dtl.DTLVardef@1a48274org.highwire.dtl.DTLVardef@170b999org.highwire.dtl.DTLVardef@1a8fdfcorg.highwire.dtl.DTLVardef@1bacb04_HPS_FORMAT_FIGEXP M_FIG C_FIG

7
Dogcatcher2: Improved statistical detection of transcriptional readthrough and repetitive element analysis across sequencing platforms

melnick, m.; Link, C. D.

2026-05-12 bioinformatics 10.64898/2026.05.07.723642 medRxiv
Top 0.1%
22.7%
Show abstract

Downstream of Gene (DoG) transcription occurs when RNA polymerase II fails to terminate normally at the transcription end site, resulting in extended transcription downstream of the gene. This is a widespread phenomenon linked to cellular stress, cancer and neurodegeneration. Existing tools for DoG detection from short-read RNA-seq rely on absolute coverage thresholds and sliding window approaches that are sensitive to sequencing depth and expression level. Here we present Dogcatcher2, which applies improved statistical detection methods to gene body-normalized coverage profiles. Using long-read ground truth across multiple datasets, we show that Dogcatcher2 outperforms existing methods in both detection sensitivity and boundary accuracy while maintaining high precision even at low sequencing depths. Dogcatcher2 further improves detection on pseudobulk scRNA-seq and snRNA-seq data. Analysis of DoG regions in human reveals specific enrichment for Alu elements including inverted Alu pairs capable of forming double-stranded RNA, with transposable elements within DoG regions showing elevated expression, connecting readthrough transcription to dsRNA generation and innate immune signaling.

8
BARseq3: a modular system for integrating spatial multi-omics and cellular barcoding in single cells

Qi, H.; Anant, M. M.-G.; Faltine-Gonzalez, D. Z.; Hu, R.; Wei, L.; Workman, C. D.; Shi, C.; Del Rosario, I.; Kebschull, J. M.

2026-05-16 genomics 10.64898/2026.05.13.724900 medRxiv
Top 0.1%
22.5%
Show abstract

Understanding cellular identity requires multimodal measurements in single cells. Cellular barcoding provides powerful tools for recording the properties or history of individual cells in nucleic acids, while spatial omics techniques enable the measurement of a growing list of molecular features at micron resolution in tissue. However, existing methods that integrate these approaches in single samples are limited in the modalities they support, their flexibility, and efficiency. Here, we present BARseq3, a modular system that combines cellular barcoding with high-efficiency spatial transcriptomics and translatomics at subcellular resolution in tissue. BARseq3 is compatible with fixed samples, immunostaining, diverse species, and can be easily extended to include other spatial assays, enabling a multimodal understanding of cellular identity.

9
The human gut virome is a non-redundant and clinically informative component of the microbiome

Yang, Y.; Huang, D.; Korzenik, J. R.; Weiss, S. T.; Liu, Y.-Y.; Sun, Z.

2026-05-15 bioinformatics 10.64898/2026.05.13.724676 medRxiv
Top 0.1%
22.5%
Show abstract

The gut virome represents a vast reservoir of genetic diversity with profound implications for human health, yet it remains the "dark matter" of the microbiome due to the staggering complexity of reproducible viral profiling. It remains fundamentally contested whether biologically informative virome signals can be robustly recovered from routine whole-metagenome sequencing (WMS), and to what extent these signals offer ecological insights independent of the bacteriome. Here we present VIP2B, a framework that leverages Type IIB restriction tags to extract multifaceted viral features (encompassing taxonomy, coverage, function, and phenotype) directly from bulk WMS data. Through extensive benchmarking across incomplete references, unseen genomes, and high bacterial or host background, we demonstrate that VIP2B achieved high precision and robust taxonomic concordance. By applying VIP2B to paired bulk and virus-like particle (VLP)-enriched datasets, we reveal a species-level overlap far greater than previously recognized, proving that standard bulk metagenomes contain a wealth of recoverable viral information. Analysis of 20 clinical cohorts demonstrates that coverage-, function-, and phenotype-resolved viral features consistently identify disease-associated signatures that escape taxonomic analysis alone, significantly improving diagnostic models over bacteriome-only approaches. Finally, we define two distinct gut virome community states at the population scale (n=6,090), characterized by divergent diversity profiles and health associations. Our findings establish the gut virome as a non-redundant, clinically actionable component of the human holobiont and provide the methodology necessary to transition microbiome research toward a truly multi-kingdom framework.

10
ProtSpace: Protein Universe in Your Browser

Senoner, T.; Vahidi, P.; Olenyi, T.; Senoner, F.; Sisman, G.; Kahl, E.; Rost, B.; Koludarov, I.

2026-05-07 bioinformatics 10.64898/2026.05.04.722720 medRxiv
Top 0.1%
22.2%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWProtein Language Models (pLMs) generate per-protein embeddings that encode functional, structural, and evolutionary information, yet the relationships captured in these representations remain difficult to explore systematically. ProtSpace (https://protspace.app) is a web application for interactive visualization of pLM embedding spaces, enabling hypothesis generation directly in the browser without installation. Unlike traditional network-based tools that exclusively visualize amino acid sequence similarity, ProtSpace explores embedding spaces, revealing relationships often not captured by traditional comparisons. Users provide protein sequences or pre-computed embeddings through a Google Colab notebook or the Python CLI; the pipeline applies dimensionality reduction, retrieves 38 annotation types spanning UniProt, InterPro, NCBI Taxonomy, TED structural domains, and sequence-based predictors served via Biocentral, and produces a portable binary file for the browser-based viewer. WebGL-accelerated rendering supports interactive exploration of over 570,000 proteins. Distinctive features include per-point pie charts for multi-label annotations and integrated 3D structure viewing through AlphaFold2 predictions. All computation happens on the users machine, ensuring data privacy. We demonstrate the utility of ProtSpace through a progressive zoom-in across biological scales: from global proteome organization of Swiss-Prot, through cross-species comparison revealing conserved and lineage-specific families, to functional hypothesis generation within the beta-lactamase superfamily. ProtSpace is freely available at https://protspace.app under the Apache 2.0 license. KO_SCPLOWEYC_SCPLOWO_SCPCAP C_SCPCAPO_SCPLOWPOINTSC_SCPLOWO_LIProtSpace is a free, open-source web application that visualizes protein Language Model (pLM) embeddings as interactive maps, scaling to 570,000 proteins entirely client-side. C_LIO_LIA zero-installation Google Colab notebook and a Python CLI prepare visualization-ready bundles from FASTA files, UniProt queries, or pre-computed HDF5 embeddings, automatically retrieving 38 annotation types from five sources (UniProt, InterPro, NCBI Taxonomy, TED structural domains, and Biocentral sequence predictors) alongside custom CSV metadata. C_LIO_LIApplication examples demonstrate that embedding visualizations generate testable biological hypotheses at multiple scales, from proteome-wide organization through species-level comparison to family-level functional discovery, and that these are complementary to traditional sequence-based analyses. C_LI

11
Scalable genotyping in fixed transcriptomes resolves clonal heterogeneity via single-cell sequencing

Blattman, S. B.; Maslah, N.; Varela, A. A.; Kumpaitis, K.; Nalbant, B.; Snopkowski, C.; Mariani, M.; Kida, L. C.; Takizawa, M.; Ratnayeke, N.; Yu, K. K. H.; Fernandes, S.; Mousavi, N.; Borgstrom, E.; Vallejo, D.; Boghospor, L.; Xin, R.; Mignardi, M.; Wu, S.; Scarlott, N.; Delgado-Rivera, L.; Kumar, P.; Krishnan, S.; Giraudier, S.; Kiladjian, J.-J.; Howitt, B. E.; Kohlway, A.; Lund, P.; Pe'er, D.; Chaligne, R.; Lareau, C. A.

2026-05-10 genomics 10.64898/2026.04.11.717967 medRxiv
Top 0.2%
18.9%
Show abstract

Despite the promise of single-cell transcriptomics for understanding cell states in heterogeneous populations, widely used platforms have limited ability to link transcriptional states to somatic mutations within the same cells. Here, we introduce Genotyping in Fixed Transcriptomes (GIFT) for the simultaneous detection of large numbers of targeted genetic variants with whole transcriptome profiles in single cells. The core innovation of GIFT is a rationally designed gapfilling reaction between adjacent single-stranded DNA (ssDNA) probes that barcodes native transcript sequence to enable highly-specific targeted mutation detection. GIFT achieves greater than 99% genotyping accuracy and flexible capture of hundreds of mutations per cell, including in formalin-fixed, paraffin-embedded (FFPE) tissue, enabling clonal lineage tracing in heterogeneous settings. We demonstrate the unique scalability of GIFT by profiling more than 700,000 cells from 35 donors with myeloproliferative neoplasms (MPN), revealing mutation-dependent hematopoietic responses to systemic inflammation associated with the characteristic JAK2V617 mutation, including an allelic dose gradient of interferon-associated transcriptional programs and priming of hematopoietic stem cells that develop into divergent disease states. The technical advantages of GIFT enable direct resolution of genotype-to-phenotype relationships via clonal tracing with comprehensive cell-state measurements at single-cell resolution.

12
NANOTAXI: A Shiny-Based GUI for Real-Time Classification and Analysis of 16S rRNA Nanopore Reads

Mahar, N. S.; Chouhan, K.; Gupta, I.

2026-05-20 bioinformatics 10.64898/2026.05.17.725747 medRxiv
Top 0.2%
18.7%
Show abstract

Real-time taxonomic classification of nanopore amplicon sequencing data enables rapid insights into microbial communities, with applications in clinical diagnostics, environmental monitoring, and outbreak surveillance. However, bridging the gap between long-read data and interpretable results often requires specialised bioinformatics expertise. There remains a need for integrated, user-friendly software that combines live data acquisition with downstream microbiome analysis. Here we present NANOTAXI, a fully automated Shiny-based GUI for the classification of barcoded 16S rRNA gene sequences generated by Oxford Nanopore sequencing. The platform supports four taxonomic classifiers, integrated with five reference databases, enabling flexible selection of classification strategies based on user requirements and available computational resources. In addition to real-time monitoring, NANOTAXI performs cohort-level analyses, including alpha and beta diversity, ordination, differential abundance testing, and functional inference using PICRUSt2. Validation using barcoded synthetic communities comprising pooled genomic DNA from clinically relevant bacterial species and the ZymoBIOMICS mock community demonstrated that NANOTAXI generated biologically coherent taxonomic and functional profiles. Benchmarking revealed clear trade-offs between computational performance and taxonomic specificity. Emu provided the lowest observed species-level false-positive rate, whereas Kraken2 offered the fastest classification and enabled continuous near-real-time monitoring across all tested databases. NANOTAXI is open source and freely available at https://github.com/Nirmal2310/NANOTAXI under the GPL version 3 license.

13
Atlas-Level Single-Cell and Spatial Transcriptomics Data Integration via PRIME

Wu, X.; Wang, X.; Wang, J.; Wan, S.

2026-05-23 bioinformatics 10.64898/2026.05.20.726698 medRxiv
Top 0.2%
18.7%
Show abstract

Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) have enabled atlas-scale cellular cartography, with consortium efforts now assembling millions of cells across diverse tissues, donors, and technologies to build comprehensive references for cell identify and disease mechanism, yet the scientific value of these atlases hinges on robust computational integration across heterogeneous data sources. Unlike pairwise batch correction, atlas-level integration must jointly reconcile heterogeneous and often hierarchically nested batch effects across many datasets whose cell-type compositions are highly imbalanced, all while preserving subtle biological variation and remaining computationally tractable at the scale of millions of cells. Existing approaches often prioritize either batch mixing or preservation of local biological structure, and most cannot natively accommodate spatial coordinates. Here we introduce PRIME (Projection-based Robust Integration via Manifold Embedding), an ensemble integration framework that combines random-projection-based consensus anchoring, graph-Laplacian correction, and optional spatial-neighborhood regularization. Across multiple random projections of the expression manifold, PRIME uses consensus voting to keep only cell pairs that repeatedly matched, reducing false anchors caused by projection-specific distortions. For ST, PRIME couples this expression-based anchor graph with a coordinate-derived spatial neighborhood graph in a unified graph-Laplacian objective with closed-form solution, enabling simultaneous cross-batch alignment and local spatial coherence. Based on extensive benchmarking spanning diverse datasets, we show that PRIME consistently outperforms state-of-the-art methods in both batch correction and biological conservation across scRNA-seq and ST integration scenarios and downstream tasks including trajectory inference, spatial-domain preservation, and perturbation-response analysis. Particularly, when integrating a human hematopoiesis benchmark spanning eight donors and approximately 33,000 cells, PRIME preserves biologically coherent developmental trajectories in human hematopoiesis. It also maintains cortical laminar architecture across dorsolateral prefrontal cortex sections in a ST dataset and recovers known drug-target relationships in a perturbation atlas of more than 1 million cells while suppressing batch-associated confounders. Together, these results establish PRIME as a versatile and scalable framework for atlas-level integration of scRNA-seq and ST across diverse biological applications.

14
A universal taxonomic and functional human gut microbiome model for disease classification and phenotype discovery

Karwowska, Z.; Mozejko, M.; Nowak, W.; Romanchenko, A.; Szczurek, E.; Kosciolek, T.

2026-05-05 bioinformatics 10.64898/2026.04.30.721924 medRxiv
Top 0.2%
18.7%
Show abstract

The human gut microbiome is a powerful indicator of host health, yet its compositional nature, high sparsity, and inter-individual variability complicate downstream analysis. Here, we introduce two complementary approaches to characterize gut microbiome structure at population scale. First, we define eight functional signatures of the human gut microbiome using Non-negative Matrix Factorization, revealing coordinated metabolic patterns that partially decouple from taxonomic composition. Second, we present GUT-FORMer, a transformer-based autoencoder that jointly models taxonomic and functional metagenomic profiles from close to 21,000 publicly available samples. The learned latent representations capture biologically meaningful structure, reflect geographic and disease-associated variation, and enable accurate classification of 25 diseases in both binary and multiclass settings, as well as regression of host age and BMI. GUT-FORMer outperforms existing microbiome indices and deep learning methods across all tasks, establishing a generalizable framework for microbiome-based precision medicine.

15
The Second Brain: Diffusion Models for Realistic Human Microbiome Generation

Yee, B.; Fu, J.

2026-05-11 bioinformatics 10.64898/2026.05.07.723523 medRxiv
Top 0.2%
18.6%
Show abstract

The human microbiome is a critical determinant of health and disease, but microbiome machine learning is constrained by limited data availability, heterogeneous cohort coverage, and privacy risks from individually identifying microbial signatures. Synthetic microbiome generation could support method development and privacy-preserving sharing, provided that generated samples preserve the ecological zero-inflation of real communities. We present a diffusion-based generative model with a sparsity-preserving decoder built around two sparsity-focused mechanisms: (1) prevalence-aware bias initialization that anchors per-taxon presence probabilities to observed prevalences from epoch one; and (2) a hard sparsity loss implemented with straight-through gradient estimators. The implementation also uses hyperbolic taxonomic embeddings as an unvalidated, phylogeny-aware architectural prior in the diffusion backbone. Evaluated on the American Gut Project (4,827 samples, 500 taxa), the full 15.2M-parameter model achieves parametric-level sparsity preservation: 1.4% deviation in the main comparison and 2.6%{+/-}0.5% deviation across three AGP seeds. SparseDOSSA2 achieves the lowest sparsity deviation in this comparison (0.7%), and MIDASim also passes the operational sparsity threshold (4.9%). Among the three threshold-passing methods, MIDASim achieves the best ecological distance scores, SparseDOSSA2 is best on sparsity deviation, and our model achieves the best prevalence correlation (0.996) while narrowly improving on SparseDOSSA2 on Bray-Curtis (0.0485 vs. 0.0495) and UniFrac (0.0400 vs. 0.0435) discrepancies. PERMANOVA remains able to distinguish generated from real AGP samples (F = 64.29), which we treat as an important limitation rather than evidence of indistinguishability. These results support a deliberately narrow conclusion: this is, to our knowledge, the first deep generative model to match parametric-level sparsity preservation for human microbiome profiles while remaining competitive on standard ecological distance metrics.

16
Automated Multimodal Correlative Registration for Organelle-Specific Molecular Imaging

Lu, C.; ZHAO, K.; Cui, D.; Chen, G.; Yang, Q.; Yang, H.; Zhao, M.; Song, K.; Nikan, M.; Li, Z.; Zhao, S.; Cen, J.; Qiu, X.; Young, S.; Bennett, C. F.; Seth, P.; Chen, K.; Qi, X.; Jiang, H.

2026-05-04 bioinformatics 10.64898/2026.04.30.721814 medRxiv
Top 0.2%
18.4%
Show abstract

Mapping subcellular drug distribution is essential for understanding trafficking and off-target effects. NanoSIMS enables chemical imaging of labeled therapeutics, but signal interpretation requires ultrastructural correlation with electron microscopy, a manual and laborious process. We present an automated AI-driven pipeline for correlating chemical and ultrastructural images, enabling multiscale, organelle-precise imaging of molecules in cells and tissues. The method integrates bidirectional optical flow, confidence-guided affine transformation, and automated template matching for cross-scale EM alignment. Morphology-rich ion channels (e.g., 32S) estimate transformations that propagate to sparse therapeutic signals (e.g., 79Br, 15N), overcoming low signal-to-noise challenges. We validate this framework across diverse cell and tissue types, tracking oligonucleotide and antibody therapeutics in vitro and in vivo to reveal cell-type- and organelle-specific distribution patterns. This work establishes a generalizable platform for automated multimodal registration and organelle-resolved subcellular pharmacology.

17
Clonal embeddings allow exploratory analysis of lineage-resolved single-cell data

Isaev, S.; Erickson, A. G.; Adameyko, I.; Kharchenko, P. V.

2026-05-05 bioinformatics 10.64898/2026.04.30.720820 medRxiv
Top 0.2%
18.4%
Show abstract

Assays coupling high-throughput lineage tracing with single-cell transcriptomics are transforming studies of development and disease biology, revealing not only major differentiation routes but also continuous fate biases and their putative regulators. Yet, analysis of such data at scale presents challenges due to the sparse nature of clonal data and annotation dependencies. Towards that aim we developed a machine learning approach - clone2vec - which learns informative clone embeddings directly from the cellular expression manifold, bypassing discrete cell-type labels and remaining stable when clones are represented by few cells. This representation summarizes clonal variation as an interpretable geometry that supports exploration, statistics for clone-gene associations, and cross-dataset alignment. In prospective barcoding datasets spanning embryogenesis, tumorigenesis, and hematopoiesis, clone2vec recapitulates established clonal patterns and uncovers new axes of continuous variation that implicate regulatory programs and developmental pathways. In tumor microenvironments profiled with TCR sequencing, clone2vec robustly recovers distinct Treg lineages as well as conserved CD8+ T cell sublineages across cancer types, including several bystander-like clonal subsets. Overall, clone2vec provides a robust, general solution for the exploratory analysis of lineage-coupled scRNA-seq data.

18
A layered standards framework for integrating single-cell and spatial omics data into brain cell atlases

Ray, P. L.; Miller, J. A.; Jarecka, D.; Smith, K. A.; Baker, P. M.; Ng, L.; Martone, M. E.; Trivedi, P.; Abeysinghe, R.; Anderson, L.; Bandrowski, A. E.; Edyta, V.; Bhandiwad, A. A.; Chhetri, T. R.; Cui, L.; Giglio, M.; Goldy, J.; Hong, N.; Huang, H.; Huang, Y.; Hussain, Y.; Johansen, N.; Kenney, M.; Kruse, L.; Li, X.; Meldrim, J.; Mollenkopf, T.; Nadendla, S.; Osumi-Sutherland, D.; Sanchez, R.; Scheuermann, R. H.; Tao, S.; Vanderburg, C. R.; Yang, Y.; Ropelewski, A.; Mufti, S.; Lein, E.; Xu, H.; Zheng, W. J.; Ghosh, S. S.; White, O.; Hawrylycz, M.; Zhang, G.-Q.; Thompson, C. L.

2026-05-04 genomics 10.64898/2026.04.30.722039 medRxiv
Top 0.2%
18.2%
Show abstract

The BRAIN Initiative Cell Atlas Network (BICAN) is generating large-scale multimodal datasets to profile cell types in the human, non-human primate, and mouse brain. The diversity of single-cell and spatial transcriptomic and epigenomic assays, combined with varied experimental contexts, multiple data-generating laboratories and distributed infrastructure, poses substantial challenges for data integration and reuse in BICAN. To address this, we implemented a standards framework that enables layered integration of these data into knowledge-ready products for interoperable brain cell atlases. This framework organizes data based on three progressively structured layers. First, we introduced an assay-agnostic modeling layer that unifies the representation of single-cell and spatial omics data using a common set of biological entities and processes assessed by diverse experimental techniques. Second, we implemented harmonized metadata standards that capture key experimental features linked to biospecimen provenance across heterogeneous tissue sources, species, and preparations, supporting integration and validation while minimizing burden on data contributors. Third, we present an extensible representation for data-driven cell type taxonomies that integrates molecular data with annotations, ontology mappings, and evidence. Together, these contributions represent an end-to-end framework that transforms heterogeneous datasets into structured, interoperable resources that support broad community reuse via mapping algorithms, annotation systems, and visualization platforms. This approach links biospecimen provenance with cell-level outputs and embeds these in a standardized taxonomy format, enabling downstream applications such as cross-dataset integration, reference mapping, and knowledge-driven analysis. More broadly, our work demonstrates a generalizable strategy for enabling an efficient data-to-knowledge pipeline in a large-scale consortium setting.

19
DeSpotX: Identifiability-Based Decontamination for Spatial Transcriptomics

Wang, R. H.; Gentles, A. J.

2026-05-14 bioinformatics 10.64898/2026.05.12.724704 medRxiv
Top 0.2%
18.2%
Show abstract

Spatial transcriptomics (ST) at single-cell resolution profiles gene expression in its native spatial context, but a substantial fraction of transcripts contaminate neighboring cells, compromising downstream biological analyses. Existing decontamination methods rely on heuristic priors and either ignore the spatial structure of contamination or aggregate over neighbors without separating contamination from native expression, leaving the decomposition ambiguous. To resolve this ambiguity, we introduce DeSpotX, a deep generative model that uses anchor genes, defined as genes not natively expressed in a given cell cluster, to constrain the contamination decomposition and make it identifiable. DeSpotX further uses spatial information to estimate contamination locally through a cluster-masked, distance-weighted average over neighboring cells, and prevents over-correction of low-expression signal through a learned diffusion prior. On spike-in simulations across five datasets and four ST platforms, DeSpotX achieves AUROC > 0.94 on every dataset, with gains of 0.02 to 0.12 over the best baseline, and remains robust to inaccuracies in the cell-cluster annotation and in anchor gene construction. On real tissues, we show that the decontaminated counts produce improved marker-gene specificity, more spatially coherent expression, and cell-cell communication networks consistent with known biology. We further show that iterating decontamination and cell-cluster annotation refines these outcomes, reassigning ligand-receptor signaling to the expected source cells in mouse brain and breast cancer tissues.

20
CAD-C: An engineered nuclease enables repair-free in situ proximity ligation and nucleosome-resolution chromosome walks in human cells

Soroczynski, J.; Westcott, L. A.; Zuo, W.; Ou, A.; Canaj, H.; Hickling, J.; Yeung, J. L.; Konishi, H. A.; Campbell, E. B.; Whelan, C.; Balacco, J.; Formenti, G.; Risca, V. I.

2026-05-04 genomics 10.64898/2025.12.22.695891 medRxiv
Top 0.2%
18.1%
Show abstract

Chromosome conformation capture (3C)-derived methods have become an indispensable tool in the study of gene regulation. The three-dimensional contacts probed by 3C methods depend strongly on the properties of the enzyme used to fragment chromatin prior to proximity-driven ligation. Micrococcal nuclease (MNase), used in Micro-C, increases resolution at the expense of low ligation efficiency and the need for extensive enzyme titration. To overcome these limitations, we engineered a highly active, TEV protease-activatable caspase-activated DNase (CAD) to enable an efficient, low-sequence-bias, and high-resolution proximity ligation assay we call CAD-C. CAD-C was successful on the first attempt for each human cell line tested and the resulting datasets capture loops, TADs, compartments, and stripes similarly to Micro-C. However, compared to Micro-C and Hi-C, CAD-C shows enhanced sensitivity for promoter-enhancer loops. Leveraging the ligation-competent DNA ends produced by CAD cleavage, we show that CAD-C is compatible with a highly streamlined, repair-free protocol and produces multi-step CADwalks, consecutive ligations between nucleosomal or sub-nucleosomal fragments. With these walks, we probe local chromatin fiber folding contacts, nucleosomal and sub-nucleosomal footprints, and long-range nuclear organization regimes in human cell lines. CAD-C is an efficient, robust chromatin structure assay that can span sub-nucleosomal to chromosomal length scales in a single experiment.